Motivation
- Add a conjugate prior to θ in pLSA, reduce the number of parameters from MK+KV to K+KV, which is independent from the number of documents and makes the model less prone to overfitting.
- Besides the β matrix, in pLSA, we learn M points; but in LDA, we learn a dirichlet. So we can easily generalize the trained model to unseen documents.
Specification
original version:
smoothed version:
Parameters for original version:
- K: number of topics
- M: number of documents
- Nm: length of m-th document
- V: size of vocabulary
- α: Dirichlet prior
- β: K×V, word distribution of topics
- θ: M×K, topic distribution of documents
Parameters for smoothed version:
αβθd=1…Mϕk=1…Kzd=1…M,n=1…Ndwd=1…M,n=1…Nd∼∼∼∼∼∼A Dirichlet hyperprior, either a constant or a random variableA Dirichlet hyperprior, either a constant or a random variableDirichletK(α)DirichletV(β)CategoricalK(θd)CategoricalV(ϕzdn)
In the orginal version,
Joint distribution:
p(θ,z,w|α,β)=p(θ|α)∏n=1Np(zn|θ)p(wn|zn,β)
The marginal distribution of a document:
p(w|α,β)=∫p(θ|α)⎛⎝⎜⎜∏n=1N∑znp(zn|θ)p(wn|zn,β)⎞⎠⎟⎟dθ
Posterior distribution of hidden variables:
p(θ,z|w,α,β)=p(θ,z,w|α,β)p(w|α,β)
This posterior is intractable to compute, due to the coupling between
θ and
β.
Variational distribution
Here we come up with a variational distribution on latent variables, which can be decomposed as:
q(θ,z|γ,ϕ)=q(θ,γ)∏n=1Nq(zn|ϕn)
and optimize:
(γ∗,ϕ∗)=argminγ,ϕDKL(q(θ,z|γ,ϕ)∥p(θ,z|w,α,β))
to minimize the difference between the variational distribution and the true posterior distribution.
Play with the formula above and we will get:
DKL(q∥p)+L(γ,ϕ;α,β)=logp(w|α,β)=constant for γ,ϕ
where
L(γ,ϕ;α,β)=Eq[logp(θ,z,w|α,β)]−Eq[logq(θ,z)]
So minimizing the KL divergence is equivalent to maximizing the function
L as the lower bound of
logp(w|α,β):
(γ∗,ϕ∗)=argmaxγ,ϕL(γ,ϕ;α,β)
Parameter estimation
Variational EM algorithm:
- E-step: maximize the lower bound L(γ,ϕ;α,β) with respect to the variational parameters γ and ϕ
- M-step: maximize the bound with respect to the model parameters α and β
E-step: variational inference
A few more steps:
L(γ,ϕ;α,β)=Eq[logp(θ,z,w|α,β)]−Eq[logq(θ,z)]=Eq[logp(θ|α)]+Eq[p(z|θ)]+Eq[p(w|z,β)]−Eq[logq(θ)]−Eq[logq(z)]
Struggle through
heavy math to compute each term and we finally get (
ψ is the digamma function):
Taking derivatives of this function and set derivatives to zero yields the update formulas.
The variational inference algorithm update
γ and
ϕ alternately until convergence:
M-step
Maximize L(γ,ϕ;α,β) with respect to β:
Lβ=∑d=1M∑n=1Nd∑i=1K∑j=1Vϕdniwjdnlogβij+∑i=1Kλi⎛⎝⎜⎜∑j=1Vβij−1⎞⎠⎟⎟
Taking the derivative with respect to
βij and setting it to zero:
βij∝∑d=1M∑n=1Ndϕdniwjdn
Maximize L(γ,ϕ;α,β) with respect to α:
Lα=∑d=1M⎛⎝⎜⎜logΓ⎛⎝⎜⎜∑j=1Kαj⎞⎠⎟⎟−∑i=1KlogΓ(αi)⎞⎠⎟⎟+∑i=1K⎛⎝⎜⎜(αi−1)⎛⎝⎜⎜ψ(γdi)−ψ⎛⎝⎜⎜∑j=1Kγdj⎞⎠⎟⎟⎞⎠⎟⎟⎞⎠⎟⎟
Taking the derivative with respect to
αi:
∂L∂αi=M⎛⎝⎜⎜ψ⎛⎝⎜⎜∑j=1Kαj⎞⎠⎟⎟−ψ(αi)⎞⎠⎟⎟+∑d=1M⎛⎝⎜⎜ψ(γdi)−ψ⎛⎝⎜⎜∑j=1Kγdj⎞⎠⎟⎟⎞⎠⎟⎟
It is difficult to compute
αi by setting the derivative to zero. So we compute the Hessian Matrix by: (if
i=j then
δ(i,j)=1 or 0 otherwise)
∂2L∂αi∂αj=M⎛⎝⎜⎜ψ′⎛⎝⎜⎜∑j=1Kαj⎞⎠⎟⎟−δ(i,j)ψ′(αi)⎞⎠⎟⎟
and input this Hessian Matrix and the derivative to
Newton Method to get
α.
Gibbs Sampling
(for smoothed version)
Theoretical analysis:
due to conjugate prior. Note that the normalizer of the first term is omitted, because the sum is the length of each document, which is fixed, while the second denominator might change after each update.
Algorithm:
Comparisons and discussions for MCMC and Variational Bayes see
Variational Bayes.
Extensions
Relaxing the assumptions:
- order of words doesn't matter ("Bag of words" assumption)
- order of documents doesn't matter => time-evolving, dynamic topic model
- the number of topics is assumed known, fixed and flat => Bayesian nonparametric topic model
- topics are not correlated => correlated topic model
- ...
authors, links, other labels(supervised)...
other problems
model checking, visualization, data discovery...
References
Blei, David M. "Probabilistic topic models." Communications of the ACM 55.4 (2012): 77-84.
"Machine Learning" Lecture 19: http://www.umiacs.umd.edu/~jbg/teaching/CSCI_5622/
"Probabilistic Models for Unsupervised Learning" Lecture 5: http://home.cse.ust.hk/~lzhang/teach/6931a/
Dirichlet-Multinomial Distribution: https://en.wikipedia.org/wiki/Dirichlet-multinomial_distribution#A_combined_example:_LDA_topic_models
Darling, William M. "A theoretical and practical implementation tutorial on topic modeling and gibbs sampling." Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. 2011.